Data - intensive file systems for Internet services : A rose
نویسندگان
چکیده
Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classify file systems for large clusters into two disjoint categories, those for Internet services and those for high performance computing. In this paper we compare and contrast parallel file systems, developed for high performance computing, and data-intensive distributed file systems, developed for Internet services. Using PVFS as a representative for parallel file systems and HDFS as a representative for Internet services file systems, we configure a parallel file system into a data-intensive Internet services stack, Hadoop, and test performance with microbenchmarks and macrobenchmarks running on a 4,000 core Internet services cluster, Yahoo!’s M45. Once a number of configuration issues such as stripe unit sizes and application buffering sizes are dealt with, issues of replication, data layout and data-guided function shipping are found to be different, but supportable in parallel file systems. Performance of Hadoop applications storing data in an appropriately configured PVFS are comparable to those using a purpose built HDFS.
منابع مشابه
Data - intensive file systems for Internet services : A rose by any other
Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...
متن کاملData-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL-08-114)
Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...
متن کاملDiskReduce: RAID for Data-Intensive Scalable Computing (CMU-PDL-09-112)
Data-intensive file systems, developed for Internet services and popular in cloud computing, provide high reliability and availability by replicating data, typically three copies of everything. Alternatively high performance computing, which has comparable scale, and smaller scale enterprise storage systems get similar tolerance for multiple failures from lower overhead erasure encoding, or RAI...
متن کاملThe xDotGrid native, cross-platform, high-performance xDFS file transfer framework
In this paper we introduce and describe the highly concurrent xDFS file transfer protocol and examine its cross-platform and cross-language implementation in native code for both Linux and Windows in 32 or 64-bit multi-core processor architectures. The implemented xDFS protocol based on xDotGrid.NET framework is fully compared with the Globus GridFTP protocol. We finally propose the xDFS protoc...
متن کاملA Metadata Workload Generator for Data-Intensive File Systems
Large-scale data-intensive computing [2, 3] has posed numerous challenges to the underlying distributed file system, due to the unprecedented amount of data, the large number of users, the intense competition on cost and service quality, and the emergence of new applications. As a result, there has been an increasing amount of research on scalable metadata management [4, 6], high availability [...
متن کامل